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GENETIC COMPOSITIONS AND METHODS 

BACKGROUND OF THE INVENTION 
The genomes of all organisms undergo spontaneous 
mutation in the course of their continuing evolution 
generating variant forms of progenitor sequences (Gusella, 
Ann. Rev. Biochem. 55, 831-854 (1986)). The variant form may 
5 confer an evolutionary advantage or disadvantage relative to a 
progenitor form or may be neutral. In some instances, a 
variant form confers a lethal disadvantage and is not 
transmitted to subsequent generations of the organism. In 
other instances, a variant form confers an evolutionary 

10 advantage to the species and is eventually incorporated into 

the DNA of many or most members of the species and effectively 
becomes the progenitor form. In many instances, both 
progenitor and variant form(s) survive and co-exist in a 
species population. The coexistence of multiple forms of a 

15 sequence gives rise to polymorphisms. 

Several different types of polymorphism have been 
reported. A restriction fragment length polymorphism (RFLP) 
means a variation in DNA sequence that alters the length of a 
restriction fragment as described in Botstein et al., Am. J. 

20 Hum. Genet. 32, 314-331 (1980). The restriction fragment 

length polymorphism may create or delete a restriction site, 
thus changing the length of the restriction fragment. RFLPs 
have been widely used in human and animal genetic analyses 
(see WO 90/13668; W090/11369; Donis-Keller , Cell 51, 319-337 

25 (1987); Lander et al., Genetics 121, 85-99 (1989)). When a 
heritable trait can be linked to a particular RFLP, the 
presence of the RFLP in an individual can be used to predict 
the likelihood that the animal will also exhibit the trait. 
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Other polymorphisms take the form of short tandem 
repeats (STRs) that include tandem di-, tri- and tetra- 
nucleotide repeated motifs. These tandem repeats are also 
referred to as variable number tandem repeat (VNTR) 
5 polymorphisms. VNTRs have been used in identity and paternity 
analysis (US 5,075,217; Armour et al. # FEBS Lett. 307, 113-115 
(1992); Horn et al . , WO 91/14003; Jeffreys, EP 370,719), and 
in a large number of genetic mapping studies. 

Other polymorphisms take the form of single nucleotide 
10 variations between individuals of the same species. Such 
polymorphisms are far more frequent than RFLPs, STRs and 
VNTRs. Some single nucleotide polymorphisms occur in protein- 
coding sequences, in which case, one of the polymorphic forms 
may give rise to the expression of a defective or other 
15 variant protein and, potentially, a genetic disease. Examples 
of genes, in which polymorphisms within coding sequences give 
rise to genetic disease include /3-globin (sickle cell anemia) 
and CFTR (cystic fibrosis) . Other single nucleotide 
polymorphisms occur in noncoding regions. Some of these 
20 polymorphisms may also result in defective protein expression 
(e.g., as a result of defective splicing). Other single 
nucleotide polymorphisms have no phenotypic effects. 

Single nucleotide polymorphisms can be used in the 
same manner as RFLPs, and VNTRs but offer several advantages. 
25 Single nucleotide polymorphisms occur with greater frequency 

and are spaced more uniformly throughout the genome than other 
forms of polymorphism. The greater frequency and uniformity 
of single nucleotide polymorphisms means that there is a 
greater probability that such a polymorphism will be found in 
30 close proximity to a genetic locus of interest than would be 
the case for other polymorphisms. Also, the different forms 
of characterized single nucleotide polymorphisms are often 
easier to distinguish that other types of polymorphism (e.g., 
by use of assays employing allele-specif ic hybridization 
35 probes or primers) . 

Despite the increased amount of nucleotide sequence 
data being generated in recent years, only a minute proportion 
of the total repository of polymorphisms in humans and other 
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organisms has so far been identified. The paucity of 
polymorphisms hitherto identified is due to the large amount 
of work required for their detection by conventional methods. 
For example, a conventional approach to identifying 
5 polymorphisms might be to sequence the same stretch of 

oligonucleotides in a population of individuals by dideoxy 
sequencing. In this type of approach, the amount of work 
increases in proportion to both the length of sequence and the 
number of individuals in a population and becomes impractical 
10 for large stretches of DNA or large numbers of persons. 

SUMMARY OF THE CLAIMED INVENTION 
The invention provides nucleic acid segments of 
between 10 and 100 bases from a fragment shown in Table 1, 
column 1 including a polymorphic site. Complements of these 
15 segments are also included. The segments can be DNA or RNA, 
and can be double- or single -stranded. Some segments are 10- 
20 or 10-50 bases long. Preferred segments include a 
dial lei ic polymorphic site. The base occupying the 
polymorphic site in the segments can be the reference (Table 
20 1, column 3) or an alternative base (Table 1, column 5) . 

The invention further provides allele-specif ic 
oligonucleotides that hybridizes to a segment of a fragment 
shown in Table 1, column 8 or its complement. These 
oligonucleotides can be probes or primers. Also provided are 
25 isolated nucleic acids comprising a sequence of Table 1, 

column 8, or the complement thereto, in which the polymorphic 
site within the sequence is occupied by a base other than the 
reference base shown in Table 1, column 3. 

The invention further provides a method of analyzing a 
30 nucleic acid from an individual. The method determines which 
base is present at any one of the polymorphic sites shown in 
Table 1. Optionally, a set of bases occupying a set of the 
polymorphic sites shown in Table 1 is determined. This type 
of analysis can be performed on a plurality of individuals who 
35 are tested for the presence of a disease phenotype. The 
presence or absence of disease phenotype can then be 
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correlated with a base or set of bases present at the 
polymorphic sites in the individuals tested. 

The invention further provides computer -readable 
storage medium for storing data for access by an application 
program being executed on a data processing system. Such a 
medium comprises a data structure stored in the computer- 
readable storage medium, the data structure including 
information resident in a database used by the application 
program. The data structure includes a plurality of records, 
each record of the plurality comprising information 
identifying a polymorphisms shown in Table 1. 

The invention further provides a signal carrying data 
for access by an application program being executed on a data 
processing system. A data structure is encoded in the signal. 
The data structure includes information resident in a database 
used by the application program. Such information includes a 
plurality of records, each record of the plurality comprising 
information identifying a polymorphism shown in Table 1. 

BRIEF DESCRIPTION OF THE FIGURES 
20 Figs. 1A and IB depict computer systems suitable for 

storing and transmitting information relating to the 
polymorphisms of the invention. 

DEFINITIONS 

An oligonucleotide can be DNA or RNA, and single- or 
25 double -stranded. Oligonucleotides can be naturally occurring 
or synthetic, but are typically prepared by synthetic means. 
Preferred oligonucleotides of the invention include segments 
of DNA, or their complements including any one of the 
polymorphic sites shown in Table 1. The segments are usually 
30 between 5 and 100 bases, and often between 5-10, 5-20, 10-20, 
10-50, 15-50, 15-100, 20-50 or 20-100 bases. The polymorphic 
site can occur within any position of the segment. The 
segments can be from any of the allelic forms of DNA shown in 
Table 1. 

35 Hybridization probes are oligonucleotides capable of 

binding in a base-specific manner to a complementary strand of 
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nucleic acid. Such probes include peptide nucleic acids , as 
described in Nielsen et al., Science 254, 1497-1500 (1991). 

The term primer refers to a single- stranded 
oligonucleotide capable of acting as a point of initiation of 
5 template-directed DNA synthesis under appropriate conditions 
(i.e., in the presence of four different nucleoside 
triphosphates and an agent for polymerization, such as, DNA or 
RNA polymerase or reverse transcriptase) in an appropriate 
buffer and at a suitable temperature. The appropriate length 
10 of a primer depends on the intended use of the primer but 
typically ranges from 15 to 30 nucleotides. Short primer 
molecules generally require cooler temperatures to form 
sufficiently stable hybrid complexes with the template. A 
primer need not reflect the exact sequence of the template but 
15 must be sufficiently complementary to hybridize with a 

template. The term primer site refers to the area of the 
target DNA to which a primer hybridizes. The term primer pair 
means a set of primers including a 5" upstream primer that 
hybridizes with the 5 ! end of the DNA sequence to be amplified 
20 and a 3 1 , downstream primer that hybridizes with the 

complement of the 3* end of the sequence to be amplified. 

Linkage describes the tendency of genes, alleles, loci 
or genetic markers to be inherited together as a result of 
their location on the same chromosome, and can be measured by 
25 percent recombination between the two genes, alleles, loci or 
genetic markers. 

Polymorphism refers to the occurrence of two or more 
genetically determined alternative sequences or alleles in a 
population. A polymorphic marker or site is the locus at 
30 which divergence occurs. Preferred markers have at least two 
alleles, each occurring at frequency of greater than 1%, and 
more preferably greater than 10% or 20% of a selected 
population. A polymorphic locus may be as small as one base 
pair. Polymorphic markers include restriction fragment length 
35 polymorphisms, variable number of tandem repeats (VNTR's), 

hypervariable regions, minisatellites, dinucleotide repeats, 
trinucleotide repeats, tetranucleotide repeats, simple 
sequence repeats, and insertion elements such as Alu. The 
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first identified allelic form is arbitrarily designated as a 
the reference form and other allelic forms are designated as 
alternative or variant alleles. The allelic form occurring 
most frequently in a selected population is sometimes referred 
5 to as the wildtype form. Diploid organisms may be homozygous 
or heterozygous for allelic forms. A diallelic polymorphism 
has two forms. A triallelic polymorphism has three forms. 

A single nucleotide polymorphism occurs at a 
polymorphic site occupied by a single nucleotide, which is the 
10 site of variation between allelic sequences. The site is 

usually preceded by and followed by highly conserved sequences 
of the allele (e.g., sequences that vary in less than 1/100 or 
1/1000 members of the populations) . 

A single nucleotide polymorphism usually arises due to 
15 substitution of one nucleotide for another at the polymorphic 
site. A transition is the replacement of one purine by another 
purine or one pyrimidine by another pyrimidine. A 
transversion is the replacement of a purine by a pyrimidine or 
vice versa. Single nucleotide polymorphisms can also arise 
20 from a deletion of a nucleotide or an insertion of a 
nucleotide relative to a reference allele. 

Hybridizations are usually performed under stringent 
conditions, for example, at a salt concentration of no more 
than 1 M and a temperature of at least 25 °C. For example, 
25 conditions of 5X SSPE (750 mM NaCl, 50 mM NaPhosphate r 5 mM 
EDTA, pH 7.4) and a temperature of 25-30°C are suitable for 
allele-specif ic probe hybridizations. 

An isolated nucleic acid means an object species 
invention that is the predominant species present (i.e., on a 
30 molar basis it is more abundant than any other individual 

species in the composition) . Preferably, an isolated nucleic 
acid comprises at least about 50, 80 or 90 percent (on a molar 
basis) of all macromolecular species present. Most 
preferably, the object species is purified to essential 
35 homogeneity (contaminant species cannot be detected in the 
composition by conventional detection methods) . 

Linkage disequilibrium or allelic association means 
the preferential association of a particular allele or genetic 
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marker with a specific allele, or genetic marker at a nearby 
chromosomal location more frequently than expected by chance 
for any particular allele frequency in the population. For 
example, if locus X has alleles a and b, which occur equally 
5 frequently, and linked locus Y has alleles c and d r which 

occur equally frequently, one would expect the combination ac 
to occur with a frequency of 0.25. If ac occurs more 
frequently, then alleles a and c are in linkage 
disequilibrium. Linkage disequilibrium may result from 

10 natural selection of certain combination of alleles or because 
an allele has been introduced into a population too recently 
to have reached equilibrium with linked alleles. 

A marker in linkage disequilibrium can be particularly 
useful in detecting susceptibility to disease (or other 

15 phenotype) notwithstanding that the marker does not cause the 
disease. For example, a marker (X) that is not itself a 
causative element of a disease, but which is in linkage 
disequilibrium with a gene (including regulatory sequences) 
(Y) that is a causative element of a phenotype / can be used 

20 detected to indicate susceptibility to the disease in 

circumstances in which the gene Y may not have been identified 
or may not be readily detectable. 

The present invention includes the use of any of the 
polymorphic forms shown in Table 1 as a means to determine 

25 susceptibility to a phenotype resulting from an allele or 

marker in linkage disequilibrium with such polymorphic forms. 

DESCRIPTION 

I . Novel Polymorphisms of the Invention 

The novel polymorphisms of the invention are listed in 
30 Table 1. The first column of the Table lists the names 

assigned to the fragments in which the polymorphisms occur. 
The fragments are all human genomic fragments. SGC, TIGR and 
WI respectively stand for Stanford Genome Center, The 
Institute for Genome Research and the Whitehead Institute. 
35 The sequence of one allelic form of each of the fragments 
(arbitrarily referred to as the prototypical or reference 



7 



WO 98/58529 



PCT/US98/12930 



form) has been previously published. These sequences are 
listed at http://www-genome.wi.mit.edu/ (all STS's (sequence 
tag sites)); http://shgc.stanford.edu (Stanford STS's); and 
http://ww.tigr.org/ (TIGR STS's). The Web sites also list 
5 primers for amplification of the fragments, and the genomic 

location of fragments. Some fragments are expressed sequence 
tags, and some are random genomic fragments. All information 
in the websites concerning the fragments listed in Table 1 is 
incorporated by reference in its entirety for all purposes. 
10 The second column lists the position in the fragment in 

which a polymorphic site has been found. Positions are 
numbered consecutively with the first base of the fragment 
sequence as listed in one of the above databases being 
assigned the number one. The third column lists the base 
15 occupying the polymorphic site in the sequence in the data 
base. This base is arbitrarily designated the reference or 
prototypical form but is not necessarily the most frequently 
occurring form. The fifth column in the table lists the 
alternative base(s) at the polymorphic site. The eighth 
20 column of the Table lists about 15 bases of sequence on either 
side of the polymorphic site in each fragment. The indicated 
sequences can be either DNA or RNA. In the latter, the T's 
shown in the Table are replaced by U's. The base occupying 
the polymorphic site is indicated in EUPAC-IUB ambiguity code. 
25 The fourth and sixth columns of the table show the frequency 
with which reference and alternative alleles occur at a 
polymorphic site. The seventh column in the table indicates 
the population frequency of heterozygotes of the polymorphic 
site. Also provided is a nucleic acid encoding hepatic lipase 
30 containing a polymorphism. The sequence is 
CTTCGAGAGAGATTGMACAGATTCCTGGAAG . 
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Analysis of Polymorphisms 

A. Preparation of Samples 

Polymorphisms are detected in a target nucleic acid 
from an individual being analyzed. For assay of genomic DNA, 
virtually any biological sample (other than pure red blood 
cells) is suitable. For example, convenient tissue samples 
include whole blood, semen, saliva, tears, urine, fecal 
material, sweat, buccal, skin and hair. For assay of cDNA or 
mRNA, the tissue sample must be obtained from an organ in 
which the target nucleic acid is expressed. For example, if 
the target nucleic acid is a cytochrome P450, the liver is a 
suitable source. 

Many of the methods described below require 
amplification of DNA from target samples. This can be 
accomplished by e.g., PCR. See generally PCR Technology: 
Principles and Applications for DNA Amplification (ed. H.A. 
Erlich, Freeman Press, NY, NY, 1992); PCR Protocols: A Guide 
to Methods and Applications (eds. Innis, et al., Academic 
Press, San Diego, CA, 1990); Mattila et al., Nucleic Acids 
Res. 19, 4967 (1991); Eckert et al . , PCR Methods and 
Applications 1, 17 (1991); PCR (eds. McPherson et al., IRL 
Press, Oxford); and U.S. Patent 4,683,202 (each of which is 
incorporated by reference for all purposes) . 

Other suitable amplification methods include the 
ligase chain reaction (LCR) (see Wu and Wallace, Genomics 4, 
560 (1989), Landegren et al., Science 241, 1077 (1988), 
transcription amplification (Kwoh et al., Proc. Natl. Acad. 
Sci. USA 86, 1173 (1989)), and self -sustained sequence 
replication (Guatelli et al., Proc. Nat. Acad. Sci. USA, 87, 
1874 (1990) ) and nucleic acid based sequence amplification 
(NASBA) . The latter two amplification methods involve 
isothermal reactions based on isothermal transcription, which 
produce both single stranded RNA (ssRNA) and double stranded 
DNA (dsDNA) as the amplification products in a ratio of about 
30 or 100 to 1, respectively. 
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B. Detection of Polymorphisms in Target DNA 

There are two distinct types of analysis depending 
whether a polymorphism in question has already been 
characterized. The first type of analysis is sometimes 
5 referred to as de novo characterization. This analysis 

compares target sequences in different individuals to identify 
points of variation, i.e., polymorphic sites. By analyzing a 
groups of individuals representing the greatest ethnic 
diversity among humans and greatest breed and species variety 

10 in plants and animals, patterns characteristic of the most 

common alleles/haplotypes of the locus can be identified, and 
the frequencies of such populations in the population 
determined. Additional allelic frequencies can be determined 
for subpopulations characterized by criteria such as 

15 geography, race, or gender. The de novo identification of the 
polymorphisms of the invention is described in the Examples 
section. The second type of analysis is determining which 
form(s) of a characterized polymorphism are present in 
individuals under test. There are a variety of suitable 

20 procedures, which are discussed in turn. 

1 . Allele-Specif ic Probes 

The design and use of allele-specif ic probes for 
analyzing polymorphisms is described by e.g., Saiki et al., 
Nature 324, 163-166 (1986); Dattagupta, EP 235,726, Saiki, WO 

25 89/11548. Allele-specif ic probes can be designed that 

hybridize to a segment of target DNA from one individual but 
do not hybridize to the corresponding segment from another 
individual due to the presence of different polymorphic forms 
in the respective segments from the two individuals. 

30 Hybridization conditions should be sufficiently stringent that 
there is a significant difference in hybridization intensity 
between alleles, and preferably an essentially binary 
response, whereby a probe hybridizes to only one of the 
alleles. Some probes are designed to hybridize to a segment 

35 of target DNA such that the polymorphic site aligns with a 

central position (e.g., in a 15 mer at the 7 position; in a 16 
mer, at either the 8 or 9 position) of the probe. This design 
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of probe achieves good discrimination in hybridization between 
different allelic forms. 

Allele- specif ic probes are often used in pairs, one 
member of a pair showing a perfect match to a reference form 

5 of a target sequence and the other member showing a perfect 

match to a variant form. Several pairs of probes can then be 
immobilized on the same support for simultaneous analysis of 
multiple polymorphisms within the same target sequence. 
2. Tiling Arrays 

10 The polymorphisms can also be identified by 

hybridization to nucleic acid arrays, some example of which 
are described by WO 95/11995 (incorporated by reference in its 
entirety for all purposes) . One form of such arrays is 
described in the Examples section in connection with de novo 

15 identification of polymorphisms. The same array or a 

different array can be used for analysis of characterized 
polymorphisms. WO 95/11995 also describes subarrays that are 
optimized for detection of a variant forms of a 
precharacterized polymorphism. Such a subarray contains 

20 probes designed to be complementary to a second reference 

sequence, which is an allelic variant of the first reference 
sequence. The second group of probes is designed by the same 
principles as described in the Examples except that the probes 
exhibit complementarily to the second reference sequence. The 

25 inclusion of a second group (or further groups) can be 

particular useful for analyzing short subsequences of the 
primary reference sequence in which multiple mutations are 
expected to occur within a short distance commensurate with 
the length of the probes (i.e., two or more mutations within 9 

30 to 21 bases) . 

3 . Allele-Soecif ic Primers 

An allele-specif ic primer hybridizes to a site on 
target DNA overlapping a polymorphism and only primes 
amplification of an allelic form to which the primer exhibits 
35 perfect complementarily. See Gibbs, Nucleic Acid Res. 17, 

2427-2448 (1989) . This primer is used in conjunction with a 
second primer which hybridizes at a distal site. 
Amplification proceeds from the two primers leading to a 
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detectable product signifying the particular allelic form is 
present. A control is usually performed with a second pair of 
primers, one of which shows a single base mismatch at the 
polymorphic site and the other of which exhibits perfect 
5 complement arily to a distal site. The single-base mismatch 
prevents amplification and no detectable product is formed. 
The method works best when the mismatch is included in the 3 ' - 
most position of the oligonucleotide aligned with the 
polymorphism because this position is most destabilizing to 
10 elongation from the primer. See, e.g., WO 93/22456. 

4 . Direct -Sequencing 

The direct analysis of the sequence of polymorphisms 
of the present invention can be accomplished using either the 
dideoxy chain termination method or the Maxam Gilbert method 
15 (see Sambrook et al., Molecular Cloning, A Laboratory Manual 
(2nd Ed., CSHP, New York 1989); Zyskind et al., Recombinant 
DNA Laboratory Manual, (Acad. Press, 1988)) . 

5 . Denaturing Gradient Gel Electrophoresis 
Amplification products generated using the polymerase 

20 chain reaction can be analyzed by the use of denaturing 
gradient gel electrophoresis. Different alleles can be 
identified based on the different sequence -dependent melting 
properties and electrophoretic migration of DNA in solution. 
Erlich, ed., PCR Technology, Principles and Applications for 

25 DNA Amplification, (W.H. Freeman and Co, New York, 1992) , 
Chapter 7. 

6. Single-St rand Conformation Polymorphism Analysis 
Alleles of target sequences can be differentiated 

using single-strand conformation polymorphism analysis, which 
30 identifies base differences by alteration in electrophoretic 
migration of single stranded PCR products, as described in 
Orita et al., Proc. Nat. Acad. Sci. 86, 2766-2770 (1989). 
Amplified PCR products can be generated as described above, 
and heated or otherwise denatured, to form single stranded 
35 amplification products. Single- stranded nucleic acids may 
refold or form secondary structures which are partially 
dependent on the base sequence. The different electrophoretic 
mobilities of single-stranded amplification products can be 
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related to base-sequence difference between alleles of target 
sequences . 

III. Methods of Use 

After determining polymorphic fortn(s) present in an 
5 individual at one or more polymorphic sites, this information 
can be used in a number of methods. 
A. Forensics 

Determination of which polymorphic forms occupy a set 
of polymorphic sites in an individual identifies a set of 
10 polymorphic forms that distinguishes the individual. See 
generally National Research Council, The Evaluation of 
Forensic DNA Evidence (Eds. Pollard et al . , National Academy 
Press, DC, 1996) . The more sites that are analyzed the lower 
the probability that the set of polymorphic forms in one 
15 individual is the same as that in an unrelated individual. 
Preferably, if multiple sites are analyzed, the sites are 
unlinked. Thus, polymorphisms of the invention are often used 
in conjunction with polymorphisms in distal genes. Preferred 
polymorphisms for use in forensics are diallelic because the 
20 population frequencies of two polymorphic forms can usually be 
determined with greater accuracy than those of multiple 
polymorphic forms at multi -allelic loci. 

The capacity to identify a distinguishing or unique 
set of forensic markers in an individual is useful for 
25 forensic analysis. For example, one can determine whether a 
blood sample from a suspect matches a blood or other tissue 
sample from a crime scene by determining whether the set of 
polymorphic forms occupying selected polymorphic sites is the 
same in the suspect and the sample. If the set of polymorphic 
30 markers does not match between a suspect and a sample, it can 
be concluded (barring experimental error) that the suspect was 
not the source of the sample. If the set of markers does 
match, one can conclude that the DNA from the suspect is 
consistent with that found at the crime scene. If frequencies 
35 of the polymorphic forms at the loci tested have been 

determined (e.g., by analysis of a suitable population of 
individuals) , one can perform a statistical analysis to 
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determine the probability that a match of suspect and crime 

scene sample would occur by chance. 

p(ID) is the probability that two random individuals 

have the same polymorphic or allelic form at a given 

5 polymorphic site. In diallelic loci, four genotypes are 

possible: AA, AB , BA r and BB. If alleles A and B occur in a 

haploid genome of the organism with frequencies x and y, the 

probability of each genotype in a diploid organism are (see WO 

95/12607) : 

10 Homozygote: p (AA) = x 2 

Homozygote: p(BB) = y 2 = (1-x) 2 

Single Heterozygote : p(AB) = p(BA)= xy « x(l-x) 

Both Heterozygotes: p (AB+BA) = 2xy = 2x(l-x) 

The probability of identity at one locus (i.e, the 
15 probability that two individuals, picked at random from a 

population will have identical polymorphic forms at a given 
locus) is given by the equation: 
p(ID) = (x 2 ) 2 + (2xy) 2 + (y 2 ) 2 . 

These calculations can be extended for any number of 
20 polymorphic forms at a given locus. For example, the 

probability of identity p(ID) for a 3 -allele system where the 
alleles have the frequencies in the population of x, y and z r 
respectively, is equal to the sum of the squares of the 
genotype frequencies: 
25 p(ID) = x 4 + (2xy) 2 + (2yz) 2 + (2xz) 2 + z 4 + y 4 

In a locus of n alleles, the appropriate binomial 
expansion is used to calculate p(ID) and p(exc) . 

The cumulative probability of identity (cum p(ID)) for 
each of multiple unlinked loci is determined by multiplying 
3 0 the probabilities provided by each locus. 

cum p(ID) = p(IDl)p(ID2)p(ID3) p(IDn) 

The cumulative probability of non- identity for n loci 
(i.e. the probability that two random individuals will be 
different at 1 or more loci) is given by the equation: 
35 cum p(nonlD) = 1-cum p(ID) . 

If several polymorphic loci are tested, the cumulative 
probability of non-identity for random individuals becomes 
very high (e.g., one billion to one) . Such probabilities can 
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be taken into account together with other evidence in 
determining the guilt or innocence of the suspect. 
B. Paternity Testing 

The object of paternity testing is usually to 
5 determine whether a male is the father of a child. In most 

cases, the mother of the child is known and thus, the mother's 
contribution to the child's genotype can be traced. Paternity 
testing investigates whether the part of the child 1 s genotype 
not attributable to the mother is consistent with that of the 
10 putative father. Paternity testing can be performed by 

analyzing sets of polymorphisms in the putative father and the 
child. 

If the set of polymorphisms in the child attributable 
to the father does not match the putative father, it can be 
15 concluded, barring experimental error, that the putative 

father is not the real father. If the set of polymorphisms in 
the child attributable to the father does match the set of 
polymorphisms of the putative father, a statistical 
calculation can be performed to determine the probability of 
20 coincidental match. 

The probability of parentage exclusion (representing 
the probability that a random male will have a polymorphic 
form at a given polymorphic site that makes him incompatible 
as the father) is given by the equation (see WO 95/12607) : 
25 p(exc) = xy(l-xy) 

where x and y are the population frequencies of alleles A and 
B of a diallelic polymorphic site. 

(At a triallelic site p(exc) = xy(l-xy) + yz(l- yz) + 
xz(l-xz)+ 3xyz (1-xyz) ) ) , where x, y and z and the respective 
30 population frequencies of alleles A, B and C) . 

The probability of non-exclusion is 
p(non-exc) = l-p(exc) 

The cumulative probability of non-exclusion 
(representing the value obtained when n loci are used) is 
35 thus: 

cum p(non-exc) = p(non-excl)p(non-exc2)p(non-exc3) 
p(non-excn) 
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The cumulative probability of exclusion for n loci 
(representing the probability that a random male will be 
excluded) 

cum p(exc) = 1 - cum p(non-exc). 
5 If several polymorphic loci are included in the 

analysis, the cumulative probability of exclusion of a random 
male is very high. This probability can be taken into account 
in assessing the liability of a putative father whose 
polymorphic marker set matches the child's polymorphic marker 
10 set attributable to his/her father. 

C. Correlation of Polymorphisms with Phenotvpic Traits 

The polymorphisms of the invention may contribute to 
the phenotype of an organism in different ways. Some 
polymorphisms occur within a protein coding sequence and 
15 contribute to phenotype by affecting protein structure. The 
effect may be neutral, beneficial or detrimental, or both 
beneficial and detrimental, depending on the circumstances. 
For example, a heterozygous sickle cell mutation confers 
resistance to malaria, but a homozygous sickle cell mutation 
20 is usually lethal. Other polymorphisms occur in noncoding 
regions but may exert phenotypic effects indirectly via 
influence on replication, transcription, and translation. A 
single polymorphism may affect more than one phenotypic trait. 
Likewise, a single phenotypic trait may be affected by 
25 polymorphisms in different genes. Further, some polymorphisms 
predispose an individual to a distinct mutation that is 
causally related to a certain phenotype. 

Phenotypic traits include diseases that have known but 
hitherto unmapped genetic components (e.g., 
30 agammaglobulimenia, diabetes insipidus, Lesch-Nyhan syndrome, 
muscular dystrophy, Wiskott-Aldrich syndrome, Fabry's disease, 
familial hypercholesterolemia, polycystic kidney disease, 
hereditary spherocytosis, von Willebrand's disease, tuberous 
sclerosis, hereditary hemorrhagic telangiectasia, familial 
35 colonic polyposis, Ehlers-Danlos syndrome, osteogenesis 

imperfecta, and acute intermittent porphyria) . Phenotypic 
traits also include symptoms of, or susceptibility to, 
multifactorial diseases of which a component is or may be 
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genetic, such as autoimmune diseases, inflammation, cancer, 
diseases of the nervous system, and infection by pathogenic 
microorganisms. Some examples of autoimmune diseases include 
rheumatoid arthritis, multiple sclerosis, diabetes ( insulin - 
5 dependent . and non-independent), systemic lupus erythematosus 
and Graves disease. Some examples of cancers include cancers 
of the bladder, brain, breast, colon, esophagus, kidney, 
leukemia, liver, lung, oral cavity, ovary, pancreas, prostate, 
skin, stomach and uterus. Phenotypic traits also include 
10 characteristics such as longevity, appearance (e.g., baldness, 
obesity) , strength, speed, endurance, fertility, and 
susceptibility or receptivity to particular drugs or 
therapeutic treatments . 

Correlation is performed for a population of 
15 individuals who have been tested for the presence or absence 
of a phenotypic trait of interest and for polymorphic markers 
sets. To perform such analysis, the presence or absence of a 
set of polymorphisms (i.e. a polymorphic set) is determined 
for a set of the individuals, some of whom exhibit a 
20 particular trait, and some of which exhibit lack of the trait. 
The alleles of each polymorphism of the set are then reviewed 
to determine whether the presence or absence of a particular 
allele is associated with the trait of interest. Correlation 
can be performed by standard statistical methods such as a k- 
25 squared test and statistically significant correlations 

between polymorphic form(s) and phenotypic characteristics are 
noted. For example, it might be found that the presence of 
allele Al at polymorphism A correlates with heart disease. As 
a further example, it might be found that the combined 
30 presence of allele Al at polymorphism A and allele Bl at 

polymorphism B correlates with increased milk production of a 
farm animal. 

Such correlations can be exploited in several ways. 
In the case of a strong correlation between a set of one or 
35 more polymorphic forms and a disease for which treatment is 

available, detection of the polymorphic form set in a human or 
animal patient may justify immediate administration of 
treatment, or at least the institution of regular monitoring 
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of the patient. Detection of a polymorphic form correlated 
with serious disease in a couple contemplating a family may 
also be valuable to the couple in their reproductive 
decisions. For example, the female partner might elect to 
5 undergo in vitro fertilization to avoid the possibility of 
transmitting such a polymorphism from her husband to her 
offspring. In the case of a weaker, but still statistically 
significant correlation between a polymorphic set and human 
disease, immediate therapeutic intervention or monitoring may 
10 not be justified. Nevertheless, the patient can be motivated 
to begin simple life-style changes (e.g., diet, exercise) that 
can be accomplished at little cost to the patient but confer 
potential benefits in reducing the risk of conditions to which 
the patient may have increased susceptibility by virtue of 
15 variant alleles. Identification of a polymorphic set in a 
patient correlated with enhanced receptiveness to one of 
several treatment regimes for a disease indicates that this 
treatment regime should be followed. 

For animals and plants, correlations between 
20 characteristics and phenotype are useful for breeding for 
desired characteristics. For example, Beitz et al., US 
5,292,639 discuss use of bovine mitochondrial polymorphisms in 
a breeding program to improve milk production in cows. To 
evaluate the effect of mtDNA D-loop sequence polymorphism on 
25 milk production, each cow was assigned a value of 1 if variant 
or 0 if wildtype with respect to a prototypical mitochondrial 
DNA sequence at each of 17 locations considered. Each 
production trait was analyzed individually with the following 
animal model: 

30 Y ijkpn = ii + YSi + Pj + X k + fi x + . . . 0 17 + PE n + a n +e p 

where Y ijknp is the milk, fat, fat percentage, SNF, SNF 
percentage, energy concentration, or lactation energy record; 
Ii is an overall mean; YS ± is the effect common to all cows 
calving in year- season; X k is the effect common to cows in 

35 either the high or average selection line; /J x to 0 17 are the 
binomial regressions of production record on mtDNA D-loop 
sequence polymorphisms; PE n is permanent environmental effect 
common to all records of cow n; a n is effect of animal n and 



44 



WO 98/58529 



PCT/US98/12930 



is composed of the additive genetic contribution of sire and 
dam breeding values and a Mendelian sampling effect; and ep is 
a random residual. It was found that eleven of seventeen 
polymorphisms tested influenced at least one production trait. 
5 Bovines having the best polymorphic forms for milk production 
at these eleven loci are used as parents for breeding the next 
generation of the herd. 

D. QP-netic Mapping of Phenotypic Traits 

The previous section concerns identifying correlations 
10 between phenotypic traits and polymorphisms that directly or 
indirectly contribute to those traits. The present section 
describes identification of a physical linkage between a 
genetic locus associated with a trait of interest and 
polymorphic markers that are not associated with the trait, 
15 but are in physical proximity with the genetic locus 

responsible for the trait and co-segregate with it. Such 
analysis is useful for mapping a genetic locus associated with 
a phenotypic trait to a chromosomal position, and thereby 
cloning gene(s) responsible for the trait. See Lander et al., 
20 Proc. Natl. Acad. Sci. (USA) 83, 7353-7357 (1986); Lander et 

al., Proc. Natl. Acad. Sci. (USA) 84, 2363-2367 (1987); Donis- 
Keller et al., Cell 51, 319-337 (1987); Lander et al., 
Genetics 121, 185-199 (1989)). Genes localized by linkage can 
be cloned by a process known as directional cloning. See 
25 Wainwright, Med. J. Australia 159, 170-174 (1993); Collins, 
Nature Genetics 1, 3-6 (1992) (each of which is incorporated 
by reference in its entirety for all purposes) . 

Linkage studies are typically performed on members of 
a family. Available members of the family are characterized 
30 for the presence or absence of a phenotypic trait and for a 
set of polymorphic markers. The distribution of polymorphic 
markers in an informative meiosis is then analyzed to 
determine which polymorphic markers co- segregate with a 
phenotypic trait. See, e.g., Kerem et al., Science 245, 1073- 
35 1080 (1989); Monaco et al., Nature 316, 842 (1985); Yamoka et 
al., Neurology 40, 222-226 (1990); Rossiter et al., FASEB 
Journal 5, 21-27 (1991) . 
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Linkage is analyzed by calculation of LOD (log of the 
odds) values. A lod value is the relative likelihood of 
obtaining observed segregation data for a marker and a genetic 
locus when the two are located at a recombination fraction 0, 
5 versus the situation in which the two are not linked, and thus 
segregating independently {Thompson & Thompson, Genetics in 
Medicine (5th ed, W.B. Saunders Company, Philadelphia, 1991); 
Strachan, "Mapping the human genome" in The Human Genome (BIOS 
Scientific Publishers Ltd, Oxford) , Chapter 4) . A series of 
10 likelihood ratios are calculated at various recombination 

fractions (0) , ranging from 0 « 0.0 (coincident loci) to 0 = 
0,50 (unlinked) . Thus, the likelihood at a given value of 0 
is: probability of data if loci linked at 0 to probability of 
data if loci unlinked. The computed likelihoods are usually 
15 expressed as the log 10 of this ratio (i.e., a lod score) . 

For example, a lod score of 3 indicates 1000:1 odds against an 
apparent observed linkage being a coincidence. The use of 
logarithms allows data collected from different families to be 
combined by simple addition. Computer programs are available 
for the calculation of lod scores for differing values of 0 
(e.g., LIPED, MLINK (Lathrop, Proc. Nat. Acad. Sci. (USA) 81, 
3443-3446 (1984)). For any particular lod score, a 
recombination fraction may be determined from mathematical 
tables. See Smith et al. f Mathematical tables for research 
25 workers in human genetics (Churchill, London, 1961) ; Smith, 

Ann. Hum. Genet. 32, 127-150 (1968). The value of 0 at which 
the lod score is the highest is considered to be the best 
estimate of the recombination fraction. 

Positive lod score values suggest that the two loci 
30 are linked, whereas negative values suggest that linkage is 

less likely (at that value of 0) than the possibility that the 
two loci are unlinked. By convention, a combined lod score of 
+3 or greater (equivalent to greater than 1000:1 odds in favor 
of linkage) is considered definitive evidence that two loci 
35 are linked. Similarly, by convention, a negative lod score of 
-2 or less is taken as definitive evidence against linkage of 
the two loci being compared. Negative linkage data are useful 
in excluding a chromosome or a segment thereof from 
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consideration. The search focuses on the remaining non- 
excluded chromosomal locations. 

IV. Modified Polypeptides and Gen e Sequences 

The invention further provides variant forms of 

5 nucleic acids and corresponding proteins. The nucleic acids 
comprise one of the sequences described in Table 1, column 8, 
in which the polymorphic position is occupied by one of the 
alternative bases for that position. Some nucleic acid encode 
full-length variant forms of proteins. Similarly, variant 
10 proteins have the prototypical amino acid sequences of encoded 
by nucleic acid sequence shown in Table 1, column 8, (read so 
as to be in- frame with the full-length coding sequence of 
which it is a component) except at an amino acid encoded by a 
codon including one of the polymorphic positions shown in the 

15 Table. That position is occupied by the amino acid coded by 
the corresponding codon in any of the alternative forms shown 
in the Table. 

Variant genes can be expressed in an expression vector in 
which a variant gene is operably linked to a native or other 
20 promoter. Usually, the promoter is a eukaryotic promoter for 
expression in a mammalian cell. The transcription regulation 
sequences typically include a heterologous promoter and 
optionally an enhancer which is recognized by the host. The 
selection of an appropriate promoter, for example trp, lac, 
25 phage promoters, glycolytic enzyme promoters and tRNA 
promoters, depends on the host selected. Commercially 
available expression vectors can be used. Vectors can include 
host -recognized replication systems, amplifiable genes, 
selectable markers, host sequences useful for insertion into 
30 the host genome, and the like. 

The means of introducing the expression construct into 
a host cell varies depending upon the particular construction 
and the target host. Suitable means include fusion, 
conjugation, transf ection, transduction, electroporation or 
35 injection, as described in Sambrook, supra. A wide variety of 
host cells can be employed for expression of the variant gene, 
both prokaryotic and eukaryotic. Suitable host cells include 
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bacteria such as E. coli, yeast, filamentous fungi, insect 
cells, mammalian cells, typically immortalized, e.g., mouse, 
CHO, human and monkey cell lines and derivatives thereof. 
Preferred host cells are able to process the variant gene 
5 product to produce an appropriate mature polypeptide. 

Processing includes glycosylation, ubiquitination, disulfide 
bond formation, general post- trans lational modification, and 
the like. 

The protein may be isolated by conventional means of 
10 protein biochemistry and purification to obtain a 

substantially pure product, i.e., 80, 95 or 99% free of cell 
component contaminants, as described in Jacoby, Methods in 
Enzymology Volume 104, Academic Press, New York (1984); 
Scopes, Protein Purification, Principles and Practice, 2nd 
15 Edition, Springer -Ver lag, New York (1987) ; and Deutscher (ed) , 
Guide to Protein Purification, Methods in Enzymology, Vol. 182 
(1990) . If the protein is secreted, it can be isolated from 
the supernatant in which the host cell is grown. If not 
secreted, the protein can be isolated from a lysate of the 

20 host cells. 

The invention further provides transgenic nonhuman 
animals capable of expressing an exogenous variant gene and/or 
having one or both alleles of an endogenous variant gene 
inactivated. Expression of an exogenous variant gene is 
25 usually achieved by operably linking the gene to a promoter 
and optionally an enhancer, and microinjecting the construct 
into a zygote. See Hogan et al., "Manipulating the Mouse 
Embryo, A Laboratory Manual," Cold Spring Harbor Laboratory. 
Inactivation of endogenous variant genes can be achieved by 
30 forming a transgene in which a cloned variant gene is 

inactivated by insertion of a positive selection marker. See 
Capecchi, Science 244, 1288-1292 (1989). The transgene is 
then introduced into an embryonic stem cell, where it 
undergoes homologous recombination with an endogenous variant 
35 gene. Mice and other rodents are preferred animals. Such 
animals provide useful drug screening systems. 

In addition to substantially full-length polypeptides 
expressed by variant genes, the present invention includes 
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biologically active fragments of the polypeptides, or analogs 
thereof, including organic molecules which simulate the 
interactions of the peptides. Biologically active fragments 
include any portion of the full-length polypeptide which 
confers a biological function on the variant gene product, 
including ligand binding, and antibody binding. Ligand 
binding includes binding by nucleic acids, proteins or 
polypeptides, small biologically active molecules, or large 
cellular structures. 

Polyclonal and/or monoclonal antibodies that 
specifically bind to variant gene products but not to 
corresponding prototypical gene products are also provided. 
Antibodies can be made by injecting mice or other animals with 
the variant gene product or synthetic peptide fragments 
thereof. Monoclonal antibodies are screened as are described, 
for example, in Harlow & Lane, Antibodies, A Laboratory 
Manual, Cold Spring Harbor Press, New York (1988); Coding, 
Monoclonal antibodies, Principles and Practice (2d ed.) 
Academic Press, New York (1986). Monoclonal antibodies are 
tested for specific immunoreactivity with a variant gene 
product and lack of immunoreactivity to the corresponding 
prototypical gene product. These antibodies are useful in 
diagnostic assays for detection of the variant form, or as an 
active ingredient in a pharmaceutical composition. 



V. Kits 

The invention further provides kits comprising at 
least one allele-specif ic oligonucleotide as described above. 
Often, the kits contain one or more pairs of allele-specif ic 
oligonucleotides hybridizing to different forms of a 
polymorphism. In some kits, the allele-specif ic 
oligonucleotides are provided immobilized to a substrate. For 
example, the same substrate can comprise allele-specif ic 
oligonucleotide probes for detecting at least 10, 100 or all 
of the polymorphisms shown in Table 1. Optional additional 
components of the kit include, for example, restriction 
enzymes, reverse -transcriptase or polymerase, the substrate 
nucleoside triphosphates, means used to label (for example, an 
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avidin-enzyme conjugate and enzyme substrate and chromogen if 
the label is biotin) , and the appropriate buffers for reverse 
transcription, PCR, or hybridization reactions. Usually, the 
kit also contains instructions for carrying out the methods. 

5 vi. computer Systems For Storing Pol ymorphism Data 

Fig. 1A depicts a block diagram of a computer system 
10 suitable for implementing the present invention. Computer 
system 10 includes a bus 12 which interconnects major 
subsystems such as a central processor 14, a system memory 16 
10 (typically RAM) , an input /output (I/O) controller 18, an 
external device such as a display screen 24 via a display 
adapter 26, serial ports 28 and 30, a keyboard 32, a fixed 
disk drive 34 via a storage interface 35 and a floppy disk 
drive 36 operative to receive a floppy disk 38, and a CD-ROM 
15 (or DVD-ROM) device 40 operative to receive a CD-ROM 42. Many 
other devices can be connected such as a user pointing device, 
e.g., a mouse 44 connected via serial port 28 and a network 
interface 46 connected via serial port 30. 

Many other devices or subsystems (not shown) may be 
20 connected in a similar manner. Also, it is not necessary for 
all of the devices shown in Fig. 1A to be present to practice 
the present invention, as discussed below. The devices and 
subsystems may be interconnected in different ways from that 
shown in Fig. 1A. The operation of a computer system such as 
25 that shown in Fig. 1A is well known. Databases storing 

polymorphism information according to the present invention 
can be stored, e.g., in system memory 16 or on storage media 
such as fixed disk 34, floppy disk 38, or CD-ROM 42. An 
application program to access such databases can be operably 
30 disposed in system memory 16 or sorted on storage media such 
as fixed disk 34, floppy disk 38, or CD-ROM 42. 

Fig. IB depicts the interconnection of computer system 
10 to remote computers 48, 50, and 52. Fig. IB depicts a 
network 54 interconnecting remote servers 48, 50, and 52. 
35 Network interface 46 provides the connection from client 

computer system 10 to network 54. Network 54 can be, e.g., 
the Internet. Protocols for exchanging data via the Internet 
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and other networks are well known. Information identifying 
the polymorphisms described herein can be transmitted across 
network 54 embedded in signals capable of traversing the 
physical media employed by network 54. 
5 Information identifying polymorphisms shown in Table 1 

is represented in records, which optionally, are subdivided 
into fields. Each record stores information relating to a 
different polymorphisms in Table 1. Collectively, the records 
can store information relating to all of the polymorphisms in 
10 Table 1, or any subset thereof, such as 5, 10, 50, or 100 
polymorphisms from Table 1. In some databases, the 
information identifies a base occupying a polymorphic position 
and the location of the polymorphic position. The base can be 
represented as a single letter code (i.e., A, C, G or T/U) 
15 present in a polymorphic form other than that in the reference 
allele. Alternatively, the base occupying a polymorphic site 
can be represented in IUPAC ambiguity code as shown in Table 
1. The location of a polymorphic site can be identified as 
its position within one of the sequences shown in Table 1. 
20 For example, in the first sequence shown in Table 1, the 

polymorphic site occupies the 16th base. The position can 
also be identified by reference to, for example, a chromosome, 
and distance from known markers within the chromosome. In 
other databases, information identifying a polymorphism 
25 contains sequences of 10-100 bases shown in Table 1 or the 
complements thereof, including a polymorphic site. 
Preferably, such information records at least 10, 15, 20, or 
30 contiguous bases of sequences including a polymorphic site. 



EXAMPLES 

The polymorphisms shown in Table 1 were identified by 
resequencing of target sequences from eight unrelated 
individuals of diverse ethnic and geographic backgrounds by 
hybridization to probes immobilized to microf abricated arrays. 
The strategy and principles for design and use of such arrays 
are generally described in WO 95/11995. The strategy provides 
arrays of probes for analysis of target sequences showing a 
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high degree of sequence identity to the reference sequences of 
the fragments shown in Table l, column 1. The reference 
sequences were sequence -tagged sites (STSs) developed in the 
course of the Human Genome Project [see, e.g., Science 270, 
5 1945-1954 (1995); Nature 380, 152-154 (1996)). Most STS's 
ranged from 100 bp to 300 bp in size. 

A typical probe array used in this analysis has two 
groups of four sets of probes that respectively tile both 
strands of a reference sequence. A first probe set comprises 
10 a plurality of probes exhibiting perfect complementarily with 
one of the reference sequences. Each probe in the first probe 
set has an interrogation position that corresponds to a 
nucleotide in the reference sequence. That is, the 
interrogation position is aligned with the corresponding 
15 nucleotide in the reference sequence, when the probe and 

reference sequence are aligned to maximize complementarily 
between the two. For each probe in the first set, there are 
three corresponding probes from three additional probe sets. 
Thus, there are four probes corresponding to each nucleotide 
20 in the reference sequence. The probes from the three 

additional probe sets are identical to the corresponding probe 
from the first probe set except at the interrogation position, 
which occurs in the same position in each of the four 
corresponding probes from the four probe sets, and is occupied 
25 by a different nucleotide in the four probe sets. In the 
present analysis, probes were 25 nucleotides long. Arrays 
tiled for multiple different references sequences were 
included on the same substrate. 

Multiple target sequences from an individual were 
30 amplified from human genomic DNA using primers for the 

fragments indicated in the listed Web sites. The amplified 
target sequences were fluorescent ly labelled during or after 
PCR. The labelled target sequences were hybridized with a 
substrate bearing immobilized arrays of probes. The amount of 
35 label bound to probes was measured. Analysis of the pattern 
of label revealed the nature and position of differences 
between the target and reference sequence. For example, 
comparison of the intensities of four corresponding probes 
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reveals the identity of a corresponding nucleotide in the 
target sequences aligned with the interrogation position of 
the probes. The corresponding nucleotide is the complement of 
the nucleotide occupying the interrogation position of the 
> probe showing the highest intensity (see WO 95/11995). The 
existence of a polymorphism is also manifested by differences 
in normalized hybridization intensities of probes flanking the 
polymorphism when the probes hybridized to corresponding 
targets from different individuals. For example, relative 
0 loss of hybridization intensity in a "footprint" of probes 
flanking a polymorphism signals a difference between the 
target and reference (i.e., a polymorphism) (see EP 717, 113 f 
incorporated by reference in its entirety for all purposes) . 
Additionally, hybridization intensities for corresponding 
.5 targets from different individuals can be classified into 
groups or clusters suggested by the data, not defined a 
priori, such that isolates in a give cluster tend to be 
similar and isolates in different clusters tend to be 
dissimilar. See WO 97/29212 (incorporated by reference in its 
10 entirety for all purposes) . Hybridizations to samples from 
different individuals were performed separately. Table 1 
summarizes the data obtained for target sequences in 
comparison with a reference sequence for the eight individuals 
tested. 

25 From the foregoing, it is apparent that the invention 

includes a number of general uses that can be expressed 
concisely as follows. The invention provides for the use of 
any of the nucleic acid segments described above in the 
diagnosis or monitoring of diseases, such as cancer, 

30 inflammation, heart disease, diseases of the CNS, and 

susceptibility to infection by microorganisms. The invention 
further provides for the use of any of the nucleic acid 
segments in the manufacture of a medicament for the treatment 
or prophylaxis of such diseases. The invention further 

35 provides for the use of any of the DNA segments as a 
pharmaceutical . 

All publications and patent applications cited above 
are incorporated by reference in their entirety for all 
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purposes to the same extent as if each individual publication 
or patent application were specifically and individually 
indicated to be so incorporated by reference. Although the 
present invention has been described in some detail by way of 
— illustration and example for purposes of clarity and 

understanding, it will be apparent that certain changes and 
modifications may be practiced within the scope of the 
appended claims , 
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WW AT Tfi CLAIMED IS ; 

1 1 A nucleic acid segment of between 10 and 100 

2 bases from a fragment shown in Table 1 including a polymorphic 

3 site, or the complement of the segment. 

1 2 . The nucleic acid segment of claim 1 that is 

2 DNA. 

1 3. The nucleic acid segment of claim 1 that is RNA. 

1 4. The segment of claim 1 that is less than 50 

2 bases. 

1 5. The segment of claim 1 that is less than 20 

2 bases. 

1 6. The segment of claim 1, wherein the fragment is 

2 WI-14263 and the polymorphic site is at position 49. 

1 7. The segment of claim 1, wherein the polymorphic 

2 site is diallelic. 

1 8. The segment of claim 1, wherein the polymorphic 

2 form occupying the polymorphic site is the reference base for 

3 the fragment listed in Table 1, column 3. 

1 9. The segment of claim 1, wherein the polymorphic 

2 form occupying the polymorphic site is an alternative form for 

3 the fragment listed in Table 1, column 5. 

1 10. An allele- specif ic oligonucleotide that 

2 hybridizes to a segment of a fragment shown in Table 1, column 

3 8 or its complement. 

1 11. The allele-specif ic oligonucleotide of claim 10 

2 that is probe. 
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1 12. The allele-specific oligonucleotide of claim 10, 

2 wherein a central position of the probe aligns with the 

3 polymorphic site of the fragment. 

1 13. The allele-specific oligonucleotide of claim 10 

2 that is a primer. 

1 14. The allele-specific oligonucleotide of claim 13, 

2 wherein the 3' end of the primer aligns with the polymorphic 

3 site of the fragment. 

1 15. An isolated nucleic acid comprising a sequence of 

2 Table 1, column 8 or the complement thereof, wherein the 

3 polymorphic site within the sequence or complement is occupied 

4 by a base other than the reference base show in Table 1, 

5 column 3. 

1 16. A method of analyzing a nucleic acid, comprising: 

2 obtaining the nucleic acid from an individual; and 

3 determining a base occupying any one of the polymorphic 

4 sites shown in Table 1. 

1 17. The method of claim 16, wherein the determining 

2 comprises determining a set of bases occupying a set of the 

3 polymorphic sites shown in Table 1. 

1 18. The method of claim 16, wherein the nucleic acid 

2 is obtained from a plurality of individuals, and a base 

3 occupying one of the polymorphic positions is determined in 

4 each of the individuals, and the method further comprising 

5 testing each individual for the presence of a disease 

6 phenotype, and correlating the presence of the disease 

7 phenotype with the base. 



8 19. A computer- readable storage medium for storing 

9 data for access by an application program being executed — 
10 data processing system, comprising: 



on a 
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±1 a data structure stored in the computer -readable 

12 storage medium, the data structure including information 

13 resident in a database used by the application program and 

14 including: 

15 a plurality of records, each record of the 

16 plurality comprising information identifying a polymorphisms 

17 shown in Table 1. 

18 20. The computer-readable storage medium of claim 19, 

19 wherein each record has a field identifying a base occupying a 

20 polymorphic site and a location of the polymorphic site. 



21 



21. The computer-readable storage medium of claim 19, 

22 wherein each record identifies a nucleic acid segment of 

23 between 10 and 100 bases from a fragment shown in Table 1 

24 including a polymorphic site, or the complement of the 

25 segment. 



26 



22. The computer -readable storage medium of claim 19, 

27 comprising at least 10 records, each record comprising 

28 information identifying a different polymorphism shown in 

29 Table 1. 



30 



23. The computer- readable storage medium of claim 19, 

31 comprising at least 100 records, each record comprising 

32 information identifying a different polymorphisms shown in 

33 Table 1. 

34 24. A signal carrying data for access by an 

35 application program being executed on a data processing 

36 system, comprising: 

37 a data structure encoded in the signal, said data 

38 structure including information resident in a database used by 

39 the application program and including: 

40 a plurality of records, each record of the plurality 

41 comprising information identifying a polymorphism shown in 

42 Table 1. 
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